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Abstract: Parameter estimation for and prediction of spatially or spatio-temporally correlated 
random processes are used in many areas and often require the solution of a large linear system 
based on the covariance matrix of the observations. In recent years, the dataset sizes to which 
these methods are applied have steadily increased such that straightforward statistical tools are 
computationally too expensive to be used. In the univariate context, tapering, i.e., creating 
sparse approximate linear systems, has been shown to be an efficient tool in both the estima¬ 
tion and prediction settings. The asymptotic properties are derived under an inhll asymptotic 
setting. In this paper we use a domain increasing framework for estimation and prediction us¬ 
ing multivariate tapering. Under this asymptotic regime we prove that tapering (one-tapered 
form) preserves the consistency of the untapered maximum likelihood estimator and show that 
tapering has asymptotically the same mean squared prediction error as using the corresponding 
untapered predictor. The theoretical results are illustrated with simulations. 

Keywords: one-taper likelihood; Gaussian random field; domain increasing; sparse matrix. 


1 Introduction 


Parameter estimation for and smoothing or interpolation of spatially or spatio-temporally cor¬ 
related random processes are used in many areas and often require the solution of a large linear 
system based on the covariance matrix of the observations. In recent years, the dataset sizes 
to which these methods are applied have steadily increased such that straightforward statistical 
tools are computationally too expensive to be used. For example, a typical Landsat 7 satellite 
image consists of more than 34 million pixels (30 m resolution for an approximate scene size of 
170kmxl83km; source landsat.usgs.gov). Hence, classical spatial and spatio-temporal models 
for such data sizes cannot be handled with typical soft- and hardware. Thus, one typically 
relies on approximation approaches. In the univariate context, tapering, i.e. creating sparse ap¬ 
proximate linear systems through a direct product of the (presumed) covariance function and a 
positive definite but compactly supported correlation function, has been shown to be an efficient 
tool in both the estimation and prediction settings. 

The vast majority of the theoretical work on univariate tapering has been placed in an 
infill-asymptotic setting using the concept of Gaussian equivalent measures and mis-specihed 


covariance functions set forth in a series of papers by M. Stein (1988; 1990; 1997; 1999). Subse¬ 


quently, Furrer et al. (2006); Kaufman et al. (2008); Du et al. (2009) and Wang and Loh (2011) 


have assumed a second-order stationary and isotropic Matern covariance to show asymptotic 


optimality for prediction, consistency, and asymptotic efficiency for estimation. Recently, Stein 
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(2013) has extended these results to other covariance functions by placing appropriate conditions 
on the spectral density of the covariance. 

In the infill-asymptotic setting, it is (essentially) sufficient to match the degree of differ¬ 
entiability at the origin of an appropriately chosen taper function with the smoothness of the 
(Matern) covariance at the origin. Loosely speaking, for prediction, the predictor based on ta¬ 
pered covariances has the same convergence rate as the optimal predictor and the naive formula 
for the prediction kriging variance has the correct convergence rate as well (Theorem 2.1 of 



likelihood equations. In a one-taper setting only the covariance is tapered while for two-tapered 
both the covariance and empirical covariance are affected. The one-taper equation results in 
biased estimates while the two-taper equation is an estimating equation approach and is thus 
unbiased. The price of unbiased estimates is a (severe) loss of the computational efficiency 


intended through tapering (see, e.g.. Table 2 of Kaufman et al.. 2008 or Figure 2 of Shaby and 


Ruppert, 2012). 


Extending the idea of tapering to a multivariate setting is not straightforward. The infill- 
asymptotic setting does not allow one to ‘embed’ the multivariate framework in a univariate 
one (e.g., as in Sain et al., 2011| for Gaussian Markov random fields). Ruiz-Medina and Porcu 


(2015) introduced the concept of multivariate Gaussian equivalent measures, but the conditions 
are difficult to verify and their practical applicability is not entirely convincing. Several authors 
have recently approached the problem using a increasing-domain setting (jShaby and Ruppert| 


2012; Bevilacqua et al., 2015). The main advantage of this alternative sampling scheme is that 


we are not bound to Matern type covariance functions nor to tapers that satisfy the taper condi¬ 
tion (i.e., sufficiently differentiable at the origin and at the taper length). More so, we will show 
that for collocated data, other practical tapers can be described. The main disadvantage is the 
somewhat less-intuitive conceptual framework. For example, in the case of heavy metal contents 
in sediments of a lake, infill-asymptotics can be mimicked by taking more and more measure¬ 
ments. In a increasing-domain setting, this is not possible. On the other hand asymptotics is a 
theoretical concept and in practice only a finite number of observations are available. 

The main contributions of this paper are as follows: (i) under weak conditions on the covari¬ 
ance matrix function and the taper (matrix) function form we show that in a increasing-domain 
framework the tapered maximum likelihood estimator preserves the consistency of the unta¬ 
pered likelihood estimator; (ii) the difference between the (integrated) mean squared prediction 
error of the tapered and the untapered converges in probability to zero, even when prediction is 
based on estimated parameters. Note that although we require that the taper range increases, 
no rate assumption is necessary; (iii) numerical simulations illustrate that the approach has very 
appealing finite sample properties, especially for prediction with plugin estimates we find only 
a very small loss in efficiency. 

This paper is structured as follows: Section introduces basic notation and relevant defini¬ 
tions. The main results are given in Section Section illustrates the methodology using an 
extensive simulation study. Concluding remarks are given in Section Proofs and technical 
results are presented in the appendix. 

Note that compared with directly using compactly supported covariance functions, tapering 
has several advantages. Our modeling experience has shown that the (practical) dependence 
structure is often larger or much larger than what can be handled computationally and additional 
approximations would be needed anyway. We see tapering as a computational approximation 
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that does not alter the statistical model. The taper range (degree of tapering) depends on the 
availability of memory and computing power and thus changes when the analysis is carried out 
on different computers or at some later time with improved hardware. 


2 Notation and setting 

We denote (deterministic) vectors and matrices with bold lower and upper case symbols. Ran¬ 
dom variables and processes are denoted with upper case symbols and random vectors and vector 
processes are denoted with bold upper case symbols. For x G M™, we let |a;| = maxj=i^,..^m 

and IIa:II = ^JYT=lx‘i■ 

The singular values of a n x n real matrix A = {aij) are denoted by pi{A) > • • • > Pn{A) > 0 
and, in the case when A is symmetric, the eigenvalues are denoted by Ai(A) > • • • > An(A). 
The spectral norm is given by pi (A) and ||A|||, = j |nijP denotes the Frobenius norm. 

For a sequence of random variables A„, we write = Op{l) when A„ converges to 0 in 
probability as n —>■ oo and we write A„ = Op(l) when Xn is bounded in probability as n —>■ oo. 


Let, for d G N"'' and p G N"'', hxed throughout this paper. 


{Zkis) ; s G C M"*, /c = 1,... ,p} 


( 1 ) 


be a multivariate stationary Gaussian random process. We let Z(s) = (Zi(s),..., Zp{s)Y. To 
simplify the notations, we assume, essentially without loss of generality, that: 


Condition 1. Process Q has zero mean. 

Let q G N"*" and let 0 be the compact subset [0inf,^sup]^ with — oo < 9\a_i < 9sup < +oo. For 
each 0 G 0 we consider a candidate stationary matrix covariance function for the process (Q, 
of the form C(/i; 0) = {cki{h] 6)). We assume that there exists Oq G 0, with for i = 1,... ,q, 
9mf < 9oi < ^sup) so that C(/i; Oq) = Cov (Z(s), Z{s + h)). The covariance function Ckkih; Oq) of 
the A:th (marginal) process is called a direct covariance (function) and the off-diagonal elements 
Cki{h\OQ), k ^ I, are called cross covariance (functions). We also consider a stationary taper 
matrix function of the form {tki{h)), with tki{h) = 0 for \\h\\ > 1 . 

For any n G N"*", the Gaussian processes ([^ are observed at the points xi,... ,Xn G 

Condition 2. We dispose collocated observations at the distinct locations xi,... ,Xn G M'^. 

For i = {k — l)n + a and j = {I — l)n -|- b, with k,l = 1,... ,p and a,b = 1,..., n, we 
let 2 : be the np x 1 Gaussian vector with Zi = Zk{xa), for 0 G 0 we let Sg be the np x np 
covariance matrix with aoij = Cki{xa — Xb;0) and T be the np x np taper covariance matrix 
with tij = tki{{xa — Xb)/^n), where 7 ^ > 0 is the taper range. We let Kg = "Eg o T, where the 
symbol o denotes the direct (Schur) product. 

The maximum likelihood (ML) estimator is defined by 0 ml £ sxgm.va.gLg, with 


Lg 


1 

np 


log (det {Eg)) -h —z^Eg^z. 
np 


( 2 ) 


The tapered ML estimator is defined by OtMh £ argmin^ Lg , with 

Lg = — log (det (Ke)) -h —z'^K.g^z. 
np np 


( 3 ) 
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We can assume, without loss of generality, that Zi{x) is the Gaussian process that is pre¬ 
dicted at new points. Then, for x G W^, let crg(x) be the np x 1 vector dehned by, for 
i = {k — l)n + a, k = 1,... ,p, a = 1, ..., n, ae{x)i = cik{x — Xa, 0). Define similarly the 
npxl vector by ke{x)i = cik{x - Xa]6)tik{{x - Xa)hn)- 


3 Consistent estimation and asymptotically equal prediction 


We first explore four conditions on covariance and taper matrix functions. The following condi¬ 
tion holds for all the most classical models of covariance functions with infinite supports. Note 
that models with compactly supported covariance functions can be non-differentiable with re¬ 
spect to the covariance parameters, but that tapering is irrelevant anyway in increasing-domain 
asymptotics when the original covariance functions are already compactly supported. 


Condition 3. For all fixed x G k,l = I,... ,p, Cki{x]0) is continuously differentiable with 
respect to 6 . There exist constants A < -l-oo and a > 0 so that for all i = 1,... ,q, for all a; G 
and for all 0 G 0, 


\cki ix;9)\ < 


1 -h \ x\^+°‘ 


and 




< 


A 


1 + 


Condition 4. For all k, I = l,...,p, the taper function t^i is continuous at 0 and satisfies 
tki{0) = 1 and \tki{x)\ < 1 for all x G M'^. The taper range 7 = 7 n satisfies jn -^n -^00 + 00 . 


The next condition on a minimal distance between two different observation points is assumed 
in most domain increasing settings. 

Condition 5. There exists a constant A > 0 so that for all n G N"*" and for all a b, 

I Xq^ x^j I ^ a . 

Condition 6 . There exists a constant 6 > 0 so that for all n G N"*" and for all 0 G 0, 
A d and A d. 

We expect Condition to hold in many cases when Condition also holds. For univariate 
tapering, Condition would indeed hold under mild assumptions (consider an adaptation of 
Proposition D.4 in |Bach^ 2014b). Furthermore, when the parametric model incorporates a 
nugget effect or measurement errors, then Condition holds provided that the nugget or error 
variances are lower-bounded uniformly in 6. The nugget or measurement error case is directly 
treated by Theorem Theorem would also be valid for it with a minor change of notation to 


define the integrated prediction errors (see, e.g., the context of Bachoc, 2014a). 

The next theorem and corollary (the corollary is proved using standard M-estimator tech¬ 
niques), show that if the standard conditions for consistency of the (untapered) ML estimator 
hold, then the tapering preserves this consistency, as long as 7 + 00 . 


Theorem 1 . Assume that Conditions and^hold. Then, 


as n 


00 , 


sup \Lg - Lei = Op(l)- 
6»e0 

Corollary 2 . Consider the same setting as in Theorem Assume that for all n > 0 there 
exists e > 0 so that 

inf L 0 - Le > eOp(l), 

\e-eo\>K 
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where the Op(l) may depend on e and k and goes to 0 in probability as n ^ oo. Then, as n ^ oo, 


e 


ML 


Oo 


and 


e 


tML 


>p Oq. 


Theorem and Corollary highlight the important difference between one-taper and two- 
taper ML in terms of asymptotics. One-taper approximation with fixed range 7 and independent 
of n boils down to an incorrectly specified covariance model. Thus, with fixed 7 , the tapered 
ML estimator would generally be inconsistent and would converge to the asymptotic minimizer 


of a Kullback-Leibler divergence (for the univariate case, see the discussion in Kaufman et al. 
2008, and also Watkins and Al-Boutiahi 1990, or Bachoc, 2014a). Hence, assuming 7 —?• 00 


is necessary to prove consistency, which we do here. Note that, nevertheless, no rate needs to 
be specified. These facts also entail an exposition benefit for our paper: we simply have to 
show that the one-taper approximation does not damage the untapered ML estimator. The 
question of the consistency of this latter estimator can be treated in separate references, like 


Mardia and Marshall (1984) or Bachoc (2014b) for the univariate case. Especially, identifiability 


assumptions for the covariance model need not be discussed in our paper. 

On the other hand, for the two-taper ML, consistency can be proved for a fixed 7 , provided 
notably that the model of tapered covariance and cross-covariance functions is identifiable. (In 
particular, two different covariance parameters yield two different sets of tapered covariance and 


cross-covariance functions.) We refer to Shaby and Ruppert (2012) for a corresponding proof in 


the univariate case. (Actually, we believe that a global identifiability condition might be missing 


m 


Shaby and Ruppert (2012), stronger than assumption (B) in this reference, for it is not clear 


how to go from (S.29) to (S.30) in its supplementary material.) Hence, the difference between 
the asymptotic analysis of the untapered and two-taper ML estimators is more pronounced, 
since the latter estimator is a quasi-likelihood estimator in a covariance model different from 
the original one. This is why, in Shaby and Ruppert (2012), many assumptions, notably on 


identifiability, are restated independently of the untapered ML estimator. 

These asymptotic considerations also correspond to practical aspects of the comparison be¬ 
tween one- and two-taper equations. The latter can be employed with a smaller range 7 than the 
former, which is beneficial, but on the other hand, requires the full inverse of a sparse matrix. 

The following theorem shows that tapering has no asymptotic effect on prediction, uniformly 
in the covariance parameter 6. (Note that for prediction, there is no distinction between one 
and two-taper approximation.) 

Theorem 3. Assume that Conditions 0013 and^hold. Let {xnew,n)n£n+ be a fixed sequence 
Then, as n 


in 


00 , 


sup 

6 »e 0 


^oi,Xnew,n) ^ •^l(®neio,n) 


1 2 


^0 (®n 




1 2 


— Op(l). (4) 


Assume furthermore that for any fixed 6, k and I, the functions Cki{x-,6) and t^fix) are 
continuous. Let Vn be a sequence of measurable subsets of with positive Lebesque measures 
and let fnix) be a sequence of continuous probability density functions on Vn. Then, as n ^ 00 , 


sup 

0 e 0 


'Vr, 


ae{x)^T^Q^z-Zi{x) fn{x)dx-f k0{x)'^Kg^z-Zi{x) fn{x)dx 


= Op(l). 


( 5 ) 


In 0, we assume continuity of the cross covariance, covariance and taper functions, and of 
fn{x) in order to define integrals in the L^ sense. When fn{x) is constant on Vn, Theorem]^ 
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shows that tapering does not damage the mean integrated square prediction error over any 
sequence of prediction domains 'Dn. Furthermore, in Q and Q, the terms in the differences are 
typically bounded away from zero in probability, because of Condition (consider for example 


Equation (10) in Proposition 5.2 of Bachoc, 2014b). (This would not hold only in degenerate 
cases when ajnew.n becomes arbitrarily close to an observation point or where fn{x) concentrates 
around an observation point.) Hence, also the ratio of (integrated) mean square prediction errors, 
between tapered and untapered predictions, converges to unity in general. Finally, because 
of the supremum over 0 in (|^ and , Theorem implies that the difference of tapered and 
untapered prediction errors goes to zero also when the predictions are obtained from any common 
estimator 6 . 

Remark: The condition tki{0) = 1 in Condition is necessary for Theorem Indeed, it 
is typically needed in order to guarantee that l/{np)\\'Eg — K 0 |||, goes to zero. The latter is 
necessary for Theoremas can be shown from the arguments in the proof of Proposition 3.1 in 
Bachoc (2014b[). The condition tki{0) = 1 should also be needed for Theorem]^ as is suggested 


by the second offline equation in Proposition 5.1 in Bachoc (2014b). 


4 Simulations and illustrations 

We now evaluate the finite sample performance of multivariate tapering with simulations. We 
consider a bivariate Gaussian isotropic process with Matern type direct and cross-covariances 


^ f( (1^1/(II^II/ Pki ), 


k,l = 1,2 


( 6 ) 


where F is the Gamma function and JCi, is the modified Bessel function of the second kind 


of order u (Abramowitz and Stegun, 1970). To ensure positive definiteness, constraints on 


Wki, Pkh ^kuk,l = 1,2} have to be imposed, see Gneiting et al. (2010). We use two different 
covariance models: 

(A) ranges: pn = 5, pi 2 = 3, p 22 = 4 

sills: an = 1, an = .6, a 22 = 1 

smoothness: un = un = 1^22 = 1/2 

(B) ranges: pn = 3, pn = 3, P 22 = 4 

sills: an = 1, an = .7, 0-22 = 1 

smoothness: vn = 3/2, z^i 2 = 1, 1^22 = 1/2 

The smoothness parameters will not be estimated and are fixed. Hence, 6 = {pn, P 12 , P 22 ,o'n, 
o' 12 W 22 )''' and q = 6. The Matern covariance functions satisfy Gonditionj^ 

We consider the following taper matrix functions: 


(i) tki{x) = (1 - 

(ii) tki{x) = (1 - 

(hi) tki{x) = (1 - 

(iv) tii(a:) = (1 - 

t22{x) = (1 - 


a:||)^(l + 4||a;||), k,l = 1,2. 

a:||)6(l + 6||a;|| +35||a:||V3), k,l = 1,2. 

a:||)^(l + ||a;||/2), A:,^ = l,2. 

a:||)+(l + 5||a;|| + ||a;|p), ^ 12 ( 2 :) = ^21 (a:) = (1 - ||®||)^(1 + 5||a;|| -)- fa:IP), 

*11)1(1+ 5||a:||). 
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Figure 1: Different taper functions. 


Taper matrix functions (i) - (iii) satisfy Condition and the associated taper matrices are of 
the form T = ll"'' (g) t{\\xa — a;fe||/7) where the symbol (g) denotes the Kronecker product and 


where t{-) is as indicated above. In the literature these functions are referred to as Wendlandi, 
Wendland 2 and spherical taper (Wendland, 1995 Furrer et a/., 2006). 

Taper matrix function (iv) is taken from Demel (2013) Corollary 2.2.3, based upon results 
from Theorem 3 of Ma (2011a) and Lemma 2 of Ma (2011b). The validity of this taper matrix 
function can also be shown using Theorem A in Daley et al. (2014) published later. Taper matrix 
function (iv) has ti2(0) = y^6/7 < 1 (see Figureand we investigate its finite sample behavior 
although Condition 1^ is violated. We expect similar behavior of (i), |(ii) and (iv) as the (direct) 
taper functions are very similar. 

We are sampling 4m^ locations uniformly in a domain defined by the union of squares 
[(1 — A)/2]^, centered at {±(r — 1/2), ±(s — 1/2)}, r,s = l,m. The parameter A represents 
the minimum distance between the locations and the case A = 1 is a regular grid. Prediction is 
done at the location Snew = (0,0)"'' in the center of the domain. Figure illustrates the setup. 
We present results for the two cases A = 0.2,1 (thus satisfying Condition]^ and three grid size 
parameter values m = 10,16, 25, i.e., n = 400,1024, 2500 and covariance matrix sizes 800 x 800, 
2048 X 2048, 5000 x 5000, respectively. Condition has been verified numerically. 

The next two subsections discuss the results of estimation and prediction. Computational 
details are given in the last subsection. 


• • • 

0 

• . 

• • 

• • • 

. • • 

• 

^0 h = l 


Figure 2: One set of sampled locations with simulation parameter A = 0.2 and square 
center spacing h = 1. 
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Taper range 


Taper range 


Figure 3: Effect of increasing the taper range 7 on the ML estimates. Columns 
are for the two different covariance models, rows are for different parameters (truth is 
indicated by the horizontal green line). 100 realizations have been generated (A = 1) 
based on n = 400. Each individual realization is indicated with a gray line. 


4.1 Estimation 


We first investigate OtuL and compare it to 60 as the taper range increases. Eigurej^ 
the estimates of OtuL for equispaced observations (A = 1) with n = 400, taper function (i 


summarizes 
and 


using taper ranges 7 = 4, 6 , 8 ,10 as well as no tapering (7 = Inf). As expected, for small taper 
ranges the results are biased with range parameters typically overestimated and sill parameters 
































Figure 4: Effect of increasing the domain on the ML estimates. The boxplots corre¬ 
spond to n = 400 (gray), 1024 (yellow), 2500 (light blue), left to right for each taper 
range, A = 1. See also Figure [3| 


underestimated. For smoother spatial fields (B), the bias and uncertainties are (slightly) larger. 
The estimates of the sill parameters benefit from a regularizing aspect of tapering and thus 
exhibit a consistently smaller variance compared with the untapered estimates. This effect of 
regularizing is surprisingly strong for model (B) and parameter an. 

Figure shows the effect of increasing the number of locations where we have added the 
boxplots for n = 1024 and n = 2500 (i.e., m = 16 and m = 25) to four panels of Figure]^ 
For the untapered estimates, one clearly sees that the uncertainties in the estimates decrease 
with increasing n. For the tapered estimates this effect is not as pronounced because of the 
regularizing effect of the tapering. As expected, the bias itself is not reduced by increasing the 
number of observations while keeping the taper range fixed. On the other hand, as illustrated 
in Corollary!^ when going from n = 400,7 = 4 to n = 2500,7 = 10, the distribution of the 
tapered ML estimates becomes closer to that of the untapered ones. 


4.2 Prediction 


In practice, prediction is often of prime interest and we primarily investigate the effect of tapering 
on the prediction of the first process Zi at the unobserved location x^ew = (0,0)"''. As parameter 
values we use Oq and OtuL for different taper ranges 7 . 

In Figure we display the ratio of the tapered to the untapered mean squared prediction 
errors (MSPEs) using Oq. For Model [(Aj| the loss of efficiency is in general of the order of a 
few percent (the 95% pointwise range is below 1.08 for 7 > 5). For smoother processes, the 
taper range needs to be increased in order to maintain the same efficiency. This is in sync 


with inhll-asymptotic results (see, e.g.. Figure 3 of Furrer et al.. 2006). There is little difference 


between the Wendlandi and Wendland 2 tapers. Overall, the former having in general a slightly 
smaller MSPE. 

The third row of Figure [^illustrates why it is prohibitive to use tapers that are linear at the 
origin. While the spherical taper has no influence on the screening effect (Stein, 2002) of the 


exponential Model [(A) (left panel) it completely breaks down for smoother helds (right panel). 
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Figure 5: Ratios of the tapered to the untapered MSPEs for n = 400 using Oq. 
The solid line represents MSPE ratios for equispaced locations (A = 1), the dashed 
line shows the median MSPE ratios from 100 simulations with random locations with 
A = 0.2 (gray and light gray are pointwise 50 and 95 percentiles). The blue lines 
indicate the number of points within the taper range (mean solid, median dashed and 
light blue pointwise 95 percentiles). 


Figure also links the taper range with the number of observations within the taper range. 
The MSPE ratios suggest that tapering with more than 100 locations within the taper range is 
hardly worth the effort. 

In Figure]^ we distinguish a small loss of efficiency when using taper function (iv) compared 
with and |(ii)[ This can be explained by the fact that the taper function (iv) does not satisfy 


Condition 1^ (as ti2(0) < 1). Nevertheless, this loss is far less pronounced than when using taper 
function I (ill) I for model (B) 


For very small taper ranges, the MSPE ratios shown in Figure seem large. However, 
presented in terms of differences, the effect of tapering is hardly noticeable. For example, for 
the setting (Ai) with n = 400, the MSPEs are 0.1155 0.1101 0.1098 for 7 = 3,11, oo, respectively 
(see also red line in the left panel of Figure]^. 

The left panel of Figure [^further shows the effect of increasing the number of locations on 
the MSPE. The effect of increasing n is negligible even for the theoretical MSPE, the values are 
visually indistinguishable. With as few as n = 400 we extract essentially all the information in 
the system. 
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Figure 6: Left: Effect of increasing n on the prediction error. Horizontal red lines give 
the theoretical MSPEs. Within each boxplot triplet for a specific taper range, left is 
for n = 400 (gray), middle for 1024 (yellow), and right for 2500 (light blue). Prediction 
is based on 0tML with A = 1 and 100 realizations of the bivariate process. Mean is 
indicated by the blue tick. Right: 100 bivariate predictions for n = 400 and A = 1. 
Red: simulated “truth”, green: no tapering, blue: tapering with different taper ranges. 


The right panel of Eigurej^shows the results of 100 bivariate predictions at the origin. There 
is again virtually no difference in the predictions using 7 = 4,6, 8 ,10 (blue dots) and no tapering 
(7 = Inf, green dot). Eor smoother fields (variable 1, |(B)[ ), the prediction error is smaller and 
thus the difference between the red and blue/green dots is much smaller than for variable 2. The 
choice of the taper matrix function has again only a marginal effect on the result (not shown). 

It has to be kept in mind that our simulation setup is the “least” favorable for the tapering 
approach. By including a nugget or reducing the spatial correlation we would receive even more 
appealing results because the importance of neighboring locations and their contribution to the 
prediction would be less important. Note also that estimation and prediction results can be 
improved by lowering A. 


4.3 Computational efficiency 


The analysis has been implemented with the freely available computer software R (Ihaka and 


Gentleman, 1996; R Development Core Team, 2015) running on a server with an Intel Xeon 


6 C E5-2640 2.50 GHz CPU (24 cores) and 256GB shared RAM (parallelization has not been 
explicitly exploited). The number of locations was kept below 2500 in order to maintain a 
reasonable computing time for the untapered settings, which require 0 {p^n^) computing time 
and 0{p^n^) storage using straightforward R commands with classical methodologies. 

The tapered settings have been implemented using sparse matrix data structures and algo¬ 
rithms. The package spam (Eurrer, 2014; Eurrer and Sain, 2010) is tailored in order to handle 


tapered covariance matrices, estimation, and prediction in the framework of Gaussian random 
fields. The core work load consists of calculating a Cholesky factorization of a permutation 
of the possibly tapered covariance matrix. The permutation (multiple minimum degree) im¬ 
proves storage and operation count; see Eurrer and Sain (2010), Liu| (1985), and Ng and Peyton 


(1993) for more technical details. Erom the Cholesky factor, it is straightforward to calculate 


the determinant as well as the quadratic term through two triangular solves. Hence, for large n, 
there is little difference in computational cost between a likelihood evaluation or a prediction. 
Exact operation counts are difficult to determine but the algorithms are virtually 0{pnh?‘) for 
operation count and 0 {pnh) for storage, where h is the “typical” number of observations within 
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the taper range. 

For estimation, depending on the exact implementation, many likelihood evaluations are 
necessary. Using resonable starting values, the R function optim required on average between 
100 to 250 function evaluations depending on taper range and model (n = 400). In the unta¬ 
pered case, the average was typically somewhat lower. To reduce convergence issues, we started 
estimating the untapered version using the true parameter values as starting values and sub¬ 
sequently decreased the taper range using the previous optimum as starting values. Because 
of the large size of the datasets, no convergence issues were encountered and no sample was 
“manually” treated or eliminated. 


5 Discussion and outlook 


Similarly to the univariate case, multivariate tapering is a very effective approximation approach 
for prediction and for estimation of spatially correlated random processes. The small loss in 
prediction efficiency is recouped by the computational gains for reasonably large data sizes. For 
very large datasets, approximations have to be included and tapering is the method of choice 
as the computational implementation is straightforward. Compared with other approximation 


approaches (low-rank models, e.g., Cressie and Johannesson 2008 Banerjee et al.. 2008 Stein 


2008 composite likelihood approaches, e.g., Stein et al. 2004 Bevilacqua et al. 2012 [ETdsvik 


et al., 2014, Gaussian Markov random fields type approximations, e.g., Hartman and Hossjer| 


2008; Lindgren et al.. 2011, etc) tapering is the most accessible and most scalable approach. 


Tapering is especially powerful for prediction. Even for very small tapers we have a MSPE 
that is almost identical to the MSPE for the untapered setting. However, we are substantially 
faster as a single prediction is roughly 20 and 100 times faster compared with a classical approach 
(for n = 2500 and n = 10000 using 7 = 5 ). One likelihood evaluation is similarly computing 
intensive as a single prediction and thus the same advantages hold for estimation. If the ultimate 
goal is prediction, we advocate the use of the one-taper ML plugin estimates. The two-taper 
approach is computationally self-defeating and should only be used if unbiased estimates are 
absolutely necessary. 

In the case where the different variables have a similar density of locations, we propose to 
use the same taper function for all direct and cross covariances. Compared with the taper 
range, the exact form of the taper plays a secondary role. Hence for different location sampling 
densities, possibly non-stationary, we foresee adaptive tapers as outlined by Anderes et al. (2013) 
or 


Bevilacqua et al. (2015) as a valuable alternative. 


Eor estimation, the standard optimization routines of R (optim and its derivatives) require a 
substantial amount of time. We are currently experimenting with a simple grid search algorithm 
that would approximate the ML estimate sufficiently well. Based on the simulation results in the 
last section, if prediction based on plugin estimates is of interest, the approximation is sufficient. 

While the uncertainty of the ML estimates can be harnessed through the Hessian (by product 
of the optim routine) sufficiently well, deriving uncertainty estimates for an entire prediction 
field remains a bottleneck, as accordingly many linear systems have to be solved. 
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Appendix 

Proof of the theorems 

Proof of Theorem Because 0 is compact and because of Lemma it is sufficient to show 
that, for any fixed 0, Lg — Lg = Op(l). Hence, let an arbitrary 6 be fixed. We have 


io-io = ;;^>oe(det[E.Kj‘]) + PzyE^‘-K^‘)z 

/ LU ILIJ 


= T 1 +T 2 . 

We treat Ti and T 2 separately. First 


(7) 


1 


np 


Ti = — Vlog (Aj 
np V 


l/2y, 1/2 

Kg l^gKg 


The Aj(-) above are between two constants 0 < H and B < +00 uniformly in i and n because of 
Condition and Lemma Thus, there exists a finite constant C so that for any i,n 


log (A 




- 1/2 


< c 


1-A,: 


l/2yi 1/2 

Kg l^gKg 


Thus 


C 


np 


|Ti| < -^1-A, 

np I 

Z=1 




Q 

(Cauchy-Schwarz:) < — 

np 


np 



1/2-y, 1/2 

i=i 



= C, 


np 


tr ( n-Kg^/'^T.gK 




= C 


i 


Ttr| 


= C./- 

y np 


Kg~^[Kg-j:g\ Kg~^ 


Now, because of Condition 
we have 


/9 i(Kq is bounded uniformly in n by a finite constant D. Hence 


1 


|Ti| <CD\ —\\Kg-^g\\^p, 


which goes to 0 as n —)• 00 because of Lemma Next, turning to T 2 in Q, 
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Hence, interpreting tr(AB) as a scalar product between A and B"*", we obtain by the Cauchy- 
Schwarz inequality 


1 

np' 


|E(r2)| < W£-||S-^S0„K-^|||Wf-||K« - ^eWl- 


1 

np' 


In the above display, the first square root is bounded because of Condition and of Lemma 
The second square root goes to 0 because of Lemma 10 Hence E(T 2 ) —0. Furthermore 

In the above display, the pi(-) are bounded because of Condition and Lemma Thus 
Var(T 2 ) -^n-^oo 0. So T 2 = Op{l) which finishes the proof. □ 

Proof of Theorem We only prove (§, the proof of Q being similar and technically simpler. 
Using a? — = [a + b){a — b) followed by the Cauchy-Schwarz inequality, we obtain 


sup 

e 


IVn L 


1 2 


aeixy'Eg^z - Zi{x) fn{x)dx- k 0 {x)'^Kg^z - Zi{x) fn{x)dx 


I Tin 


< / sup 
Jv„ e 


a 0 {x)'^Y:Q^z + k 0 {x)'^K„^z - 2Zi{x) crg{x)'^Y:g^z - kg{x)'^Kg^z ) fn{x)dx 


< 


f sup {aeixyEg^z + keix^Kg^z - 2 Zi{x)Y fn{x)dx 

Jv„ e 


J / sup [cT 0 {xYT,g^z - ke{xY¥.g^z)^ fn{x)dx 

V J'Dr, 0 

= ( 8 ) 

We show separately that Ui = Op(l) and U 2 = Op{l). For C/i, 

Ui < 3 [ sup fn{x)dx+3[ sup fn{x)dx 

I'D 9 ^ ' l-n f) \ / 


9 


+12 / sup(Zi(a;)) /„(a;)da::. 

Jv„ 9 

The last random integral in the above display has constant mean value 12cii(O;0o) so it is 
bounded in probability. We address the two remaining random integrals in the same way, and 
give the details for the first one only. Using a version of Sobolev embedding theorem (Theorem 
4.12, Part I, Case A in Adams and Fournier, 2003), there exists a finite constant Aq depending 
only on 0 so that 


sup (cr 0 (a;)’'’S„^ 2 :) < Ae 


'0 


(Tg{x)'^'Eg^Z 


9+1 


r 

de + AsT 

i=l 


JL 

m 


o'eixy'Sg^z 


9+1 


Hence, using Fubini theorem for non-negative integrand and (|a| + 


de. 


14 



























we obtain 


E( / sup (cr 0 (a;)’'’S 0 ^ 2 :) fn{x)dx 
lv„ e 


< 


E 


0 JVn 


a0{x)'^Y:Q^z 




fn{x)dxd6 



E 


+ ^02^'?+^ V 

Je Jv„ 

+ Ae229+2V [ [ E 

Je Jv„ 




dOi 


cTe{x)^^0^^^'^0^z\ {ae{x)^T,0^z 


fn{x)dxd6 

g+i\ 


fn{x)dxd9. 


Let A(0) be the Lebesgue measure of 0. Using the Cauchy-Schwarz inequality and letting 
Bqj^i be the positive constant so that, for X following a Gaussian distribution with zero mean, 
E(^ 2 (g+i)) ^ Bq+i{E{X‘^)Y+^, we obtain, by letting D = AeBq+i\{e)2^^+‘^, 


E( I sup (cr 0 (a;)’'’S 0 ^z) fnix)dx 


< ^e5q+iA(0)supE‘'+^ ( (ae{x)''''Sg^z 

x,e 


(9) 


+ L» ^ sup \ Ei+^ 


1=1 
9 


daeixy 

OB, 


x,6 


S^^z) ) supWE''+^ ((cr0(a;)TS0^z)^ 


+ D^supWE''+^ ((^aeixy-Eg^^^'Eg^z'^ ^ sup Je‘^+^ [{aeixY'Eg^zf 

^CC ^ ^ y \ ^ / X ^ ^ V 


Now, all the E'^^^(-) above are of the form E'^'''^([u; 0 (a;)'''M 0 z]^). Furthermore, Mg is 
symmetric and satishes, by using Condition and Lemma [§ sup 0 pi(M 0 ) < C for a finite 
constant C. Finally, for i = k{n — 1) + a, with k = 1,... ,p and a = 1,... ,n, sup^ \ wg{x)i\ < 
G/{1 + la: — a;a|'^'’""), for a finite constant G. Hence, 


supE([uj 0 (a;)’'’M 0 z]^) = sup t(; 0 (a;)’'’M 05 ] 0 QM 6 »i(; 6 »(®) 

x,6 x,0 

< sup||it:0(a;)||2C'^suppi(S0j, 

x,6 0 

which is bounded because of Lemmas and Hence, in Q, Ui = Op(l). Let us now turn to 
C/ 2 . Using the Sobolev embedding theorem again with the constant He, we obtain 


E(C/ 2 ) < Ae E 

Je JVn 


crg{x)'^'Eg^z - kg{x)'^Kg^z 


1 2 


q+E 


+Ae^ 


E 


i=l “'® 

q 

AqIq + He ^ li 
i=l 


_d_ 


cTe{x)'^'Eg^z - ke{x )' 


Tt^-1, 

0 


fn{x)dxd6 

g+i\ 


1 2 


fn{x)dxd6 


In the above display, we only show that the integrals Ii,... ,Iq converge to 0, since it is more 
difficult than for the integral Iq. Hence let us fix an integer i in {1,..., q}. Using Cauchy-Schwarz 


15 
























inequality, we have 


li < Ae\{Q)2^"^^s\ivsj^^-^[(Te{xy'^0^z-ke{xy^g^z\ 


X sup WE 

x,e V 


(Te{xyT.g^z - keixyKg^z 


2(g+i)\ 


Again, both of the supremums of square roots in the above display go to 0 as n —>■ oo and we 
show it only for the first one, since it is more difficult than for the second one. Using the positive 
constant i?q+i used before Q, it is sufficient to show that 


goes to 0 as n —)• oo. Then, we use 


ae{xyT,Q^z - keix^'K^^z 


(oii — 022 )^ < 2 [(oii — 021 )^ + (021 — 022 )^] 


and 


(^1111 — &2222)^ < 4 [(61111 — 62111)^ + (62111 — 62211)^ + (62211 — 62221)^ + (62221 — 62222)^] 


where subscripts 1 and 2 denote “untapered” and “tapered” and where for example 021 = 
{[dke{x)]/[d9i\yz and 62211 = ke{xy{[dTie\/[d9i\]TiQ^z. From this, it is sufficient 
to show that a generic term of the form 


sup E 

9,x 



we{x)yM.ez 



( 10 ) 


or 


supE 

6,x 


supE 

0,x 


meixyMgi'S^^ -Kg^)lSigz 
(dUg 5K0\ 


1 2 


mg{x) Me 


\ d9i d9i J 




( 11 ) 


( 12 ) 


goes to 0. In ([^, @ and @, supepi(Me) and supep^Ne) are bounded (Condition]^ 
and Lemma [^; vg{x) — wg(x) = crg(x) — kg(x) or vg{x) — wg{x) = {d<jg{x))/{d9i) — 
{dkg{x))/{d9i)-, and mg{x) = kg{x) or mg{x) = {[dkg{x)]/[d9i]}. 

Let us now show that a generic term of the form (10) goes to 0. We have 


supE 

6,x 


(vg(x) 


Wg(x)yMgZ 



= SUp(vg(x) - Wg(x}yMgEg^Mg'''(vg(x) - Wg{x)) 
6,x 

< suppi(MeS0gM0’'’)sup||i»0(a;) - wg{x)y, 

6 6,x 


which goes to 0 as n —)■ 00 by remembering that supg pi(M 0 ) is bounded and by using Lemmas]^ 

andm 


16 





















For a generic term of the form (|11[), we have 


sup E 

9,x 




1 2 


= supE 

6,x 


m0{x)'^M0Kg^ (Ke - 


1 2 


(13) 


= sup m 0 {xfM 0 K-^ (Ke - S^) ^ (K^ - K-^M 0 '^m 0 {x) 


6,x 


<snp\\m0{x)\\^pi{M0ypi{N0ypi{^-^ypi{K-^ypi{'S0,)pi{K0-'S0y. 

0,x 

In the above display, supg ,^, ||m 0 (a;)|p is bounded because of Lemma Furthermore all the 
Pi{-y, except the last one are bounded uniformly in 0 , by remembering that sup^ pi(M 0 ) and 
sup 0 /?i(N 0 ) are bounded, and because of Conditionand Lemmaj^ Finally sup^ pi(K 0 — 
goes to 0 as n —?• oo because of Lemma Hence a generic term of the form © goes to 0 as 


n —)• oo. Finally, by the same arguments as following (13), we show that a generic term of the 
form (12) goes to 0 as n —)> oo. Hence, E(C/ 2 ) in ([^ goes to 0 as n —)> oo which concludes the 
proof. □ 


Technical results 


The following lemma is a generalization of Lemma D.l in Bachoc (2014b). 


Lemma 4. Let A > 0 and a > 0 be fixed. Let f{x-, 6) he a family of functions: - 
for all 0 G 0, \ f{x; 0)| < 1/(1 + |a:;|'^+“). Then, for any m G N+, v G si, .., Sm G 
for any i y j |sj — Sj| > A, we have 


so that 
so that 


^ pn2d +°° 

supj^|/(s,-r;;0)| < 




-1 


e 


2=1 


A-^ ^ 1 + (A: - 1)'^+^ 


where the right-hand term in the above display is a finite constant depending only on d, A and 

a. 


Proof of Lemma^ By assumption on f{x, 0) we have 


sup 


^|/(si - v,e)\ < 


1 


2=1 


2 = 1 


1 + isj — vy +°‘' 


Let, for A: > 1, be the number of points Sj in Ek = {w, |u; — r;| < A:}\{u;; \w — v\ < k — 1}. 
Then, to the points sj that are in we can associate disjoint | • |-balls in E^ so that 
each of them has volume (A/2)'^ (recall |a| = max; |a;|). The total volume occupied by these 
balls is Nk{A/2y. On the other hand, the volume of Ek is 

{2ky - {2k - 2y = 2'^ du'^-^du < 2^dk^-\ 

Jk-l 

So we have Nk < d2^'^k^~^/. The result is then obtained by noting that for Sj G Ek, 

\sj — v\ > k — 1. □ 


The following lemma is a generalization of Lemma D.3 in Bachoc (2014b). 
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Lemma 5. Consider the setting of Lemma^ Then, for any N G N"*", for any m G N"*", v G 
si,Sm G so that for any i j |sj — Sj| > A, we have 


sup 

e 


E 


\f{si - v,e)\ < 


d2 


2d +“ 

E 




-1 


Arf 1 + (/j _ l)rf+a’ 

k=N ^ ' 


i=\^...fra\\si — v\>N—\ 

where the right-hand term in the above display is a function of N, d, A and a only, that goes 
to 0 as N ^ +CX) and for fixed d, A, a. 


Proof of Lemma^ The lemma is obtained by the proof of Lemma by noting that only the 
points Sj that are in Ek for k > N give a non-zero contribution to the sum in the left-hand side 
of the display in the lemma. □ 


Lemma 6. Assume that Condition^holds. Let fkfix-,0), k,l = I,... ,p bep^ functions: —>■ M 
so that for all 0 £ Q, \ fki{x-, 0)| < l/(l-|-|a;|'^'''“) and fki{x] 6) = fik{—x; 6). Let be the npxnp 
matrix defined by, for i = {k — l)n+a and j = {l — l)n+b, with k,l = 1,... ,p and a,b = 1,... ,n, 
foij = fkiixa — Xb] 0). Then, there exists a constant A < oo so that for any n, 0, pi(F 0 ) < A. 

Proof of Lemma^ Since Fg is symmetric, pifFg) = Ai(F 0 ). Hence, because of Gershgorin 
circle theorem and of |/ 6 »fcfc| < 1 for any n, 9, it is sufficient to show that 

sup 

i.n.O -1 

is finite. By writing the sum above as the sum of p subsums, it is sufficient to show that 

sup , E \fkfiXa Xj] 0)1 

k,l,a,n,e 

is finite. This is true because of Lemma [H □ 


Lemma 7. Assume that conditions 00 on(i[^ hold. Then, as n ^ oo 


sup 

i,0 


M 


Le 


= Op(l) 


and 


sup 

i,e 


M 


Le 


= Op{l). 


Proof of Lemma 0 We do the proof for Lg only since the proof for Lq is identical. We have for 
any i = 1 ,... ,g. 


sup 

6 »e 0 



= sup 
0e0 

< sup Pi 
e 


-tr S 


np 


_id'Eg\ 1 




dOi J 


- 2 : 

® dOi ® 


np 


1/2 5S0 1/2 

^ de. ^ 


- z sup Pi ( S 


np 


de, 


Now, {l/{np))z'^z is bounded in probability since it is positive with constant mean value 
(1/p) X]fc=i CfcA:(0; ^o)- The two pi(-) in the above display are bounded uniformly in 9 because 
of Pi (CD) < pi(C)pi(D), of Conditions 00 and 0 and of Lemma 0 □ 


Lemma 8. Let a > 0 and A > 0 be fixed. Let f{x-, 9) be a family of functions: —)• M so that 
for all 9, \f{x;9)\ < 1/(1 -|- |a;|‘^+"). Let t{x) be a fixed function: —)> M that is continuous at 

0 and so that t(0) = 1 and |t(®)| < 1. Let Sm be the set of all sets of points (si,..., Sm) so that 
for i j |sj — Sjl > A. Then, 

m 

sup d) - f{v - Si; 9)t{{v - Si)/j) I 

m,{si,...,Sm)eSm,V,0 

goes to 0 as 'y ^ 00 . 
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Proof of Lemma\^ Let e > 0 be fixed. Because of Lemmawe can find M G N"*" so that 
sup ^ \f{v - Si-.O) - f{v - Si]e)t{{v - Si)/-^)\<e. 

Because t is continuous at 0, we have for 7 large enough and for |d — Sj| < M — 1 

|l -t((si - v)/j) \ < 


Nm-\ 


where Nm-i is the maximum numbers of points Sj so that — n| < M — 1, over all possible m, 
V and (si, ..., Sm) £ Sm- Putting the two bounds together, and using \f{x] d)\ < 1 we obtain, 
for 7 large enough, 

m 

sup ^ \f{v - Si;e) - f{v - Si;e)t{{v - Si)/j) \ < e + NM-i^ —, 


which finishes the proof. □ 

Lemma 9. Assume that Conditions^ and^ hold. Let fki{x]0) andFg be as in Lemma^ Let 
tki{x), k,l = 1,... ,p, be the taper functions satisfying Condition^ Let 7 be the taper range, 
also satisfying Condition^ Let Gg be the np x np matrix defined by, for i = (k — l)n + a and 
j = {I - l)n + b, with k,l = l,...,p and a,b = l,...,n, ggij = fki{xa - Xb\0)tki{{xa - Xb)h). 
Then, sup0/9i(F0 - Gg) -^n^oo 0. 

Proof of Lemma\^ The lemma is a consequence of LemmaThe proof is based on Gershgorin 
circle theorem as for the proof of Lemma □ 

Lemma 10. Assume that Conditions andj^/io/d. Then, sup^ ^||S 0 —K 0 |||. goes to 0 as 
n —>■ 00. 


Proof of Lemma 10. The lemma is a consequence of Lemma 


□ 
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